Products
In-IDE
IDE extension that lets you fix coding issues before they exist!
Discover SonarQube for IDE
SaaS
Setup is effortless and analysis is automatic for most languages
Discover SonarQube Cloud
Self-Hosted
Fast, accurate analysis; enterprise scalability
Discover SonarQube Server

Secrets
ABAP
Ansible
Apex
AzureResourceManager
C
C#
C++
CloudFormation
COBOL
CSS
Dart
Docker
Flex
Go
HTML
Java
JavaScript
JCL
Kotlin
Kubernetes
Objective C
PHP
PL/I
PL/SQL
Python
RPG
Ruby
Rust
Scala
Swift
Terraform
Text
TypeScript
T-SQL
VB.NET
VB6
XML

Python static code analysis

Unique rules to find Bugs, Vulnerabilities, Security Hotspots, and Code Smells in your PYTHON code

Tags

Impact

Clean code attribute

Issue suppression comment should have the correct format
Code Smell
Populating a dictionary with a constant value should be done with dict.fromkeys() method call
Code Smell
Iteration over a dictionary key value pairs should be done with the items() method call
Code Smell
"sorted" should not be wrapped directly inside "set"
Code Smell
TaskGroup/Nursery should not be used for a single start call
Code Smell
Using ".items()" to iterate over a dictionary should be avoided if possible.
Code Smell
Passing a reversed iterable to "set()", "sorted()", or "reversed()" should be avoided
Code Smell
The "sorted" function call should not be passed to the "reversed" function as an argument
Code Smell
Redundant collection functions should be avoided
Code Smell
"defaultdict" should not be initialized with "default_factory" as a keyword argument
Code Smell
Dictionary comprehension should not use a static key
Code Smell
Generators and comprehensions should be preferred over the usage of "map" and "lambda" when creating collection
Code Smell
When iterating over an iterable object, using "list()" should be avoided
Code Smell
Async functions should use async features
Code Smell
Comprehensions only used to copy should be replaced with the respective constructor calls
Code Smell
Literal syntax should be preferred when creating empty collections or dictionaries with keyword arguments
Code Smell
Creation of collections with literals or comprehensions should not be wrapped in type constructors
Code Smell
Comprehensions should be used instead of constructors around generator expressions
Code Smell
List comprehensions should not be used with "any()" or "all()"
Code Smell
Checkpoints should be used instead of sleep(0)
Code Smell
Long sleep durations should use sleep_forever() instead of arbitrary intervals
Code Smell
Events should be used instead of `sleep` in asynchronous loops
Code Smell
Asynchronous functions should not accept timeout parameters
Code Smell
"master" and "appName" should be set when constructing PySpark "SparkContext"s and "SparkSession"s
Code Smell
PySpark's "RDD.groupByKey", when used in conjunction with "RDD.mapValues" with a commutative and associative operation, should be replaced by "RDD.reduceByKey"
Code Smell
PySpark's "DataFrame" column names should be unique
Code Smell
PySpark "dropDuplicates" subset argument should not be provided with an empty list
Code Smell
Complex logic provided to PySpark "withColumn", "filter" and "when" methods should be refactored into separate expressions
Code Smell
PySpark lit(None) should be used when populating empty columns
Code Smell
PySpark DataFrame toPandas function should be avoided
Code Smell
The "how" parameter should be specified when joining two PySpark DataFrames
Code Smell
"withColumns" method should be preferred over "withColumn" when multiple columns are specified
Code Smell
PySpark DataFrames used multiple times should be cached or persisted
Code Smell
PySpark Pandas DataFrame columns should not use a reserved name
Code Smell
The "subset" argument should be provided when using PySpark DataFrame "dropDuplicates" method
Code Smell
PySpark Window functions should always specify a frame
Code Smell
The "num_workers" parameter should be specified for "torch.utils.data.DataLoader"
Code Smell
"model.eval()" or "model.train()" should be called after loading the state of a PyTorch model
Code Smell
"torch.tensor" should be used instead of "torch.autograd.Variable"
Code Smell
Subclasses of Scikit-Learn's "BaseEstimator" should not set attributes ending with "_" in the "__init__" method
Code Smell
Important hyperparameters should be specified for machine learning libraries' estimators and optimizers
Code Smell
Nested estimator parameters modification in a Pipeline should refer to valid parameters
Code Smell
"memory" parameter should be specified for Scikit-Learn Pipeline
Code Smell
The reduction axis/dimension should be specified when using reduction operations
Code Smell
Python side effects should not be used inside a "tf.function"
Code Smell
The "validate_indices" argument should not be set for "tf.gather" function call
Code Smell
The "input_shape" parameter should not be specified for "tf.keras.Model" subclasses
Code Smell
"tf.Variable" objects should be singletons when created inside of a "tf.function"
Code Smell
"tf.function" should not depend on global or free Python variables
Code Smell
"tensorflow.function" should not be recursive
Code Smell
Using timezone-aware "datetime" objects should be preferred over using "datetime.datetime.utcnow" and "datetime.datetime.utcfromtimestamp"
Code Smell
Numpy weekmask should have a valid value
Code Smell
Dates should be formatted correctly when using "pandas.to_datetime" with "dayfirst" or "yearfirst" arguments
Code Smell
"zoneinfo" should be preferred to "pytz" when using Python 3.9 and later
Code Smell
"pytz.timezone" should not be passed to the "datetime.datetime" constructor
Code Smell
The 12-hour format should be used with the AM/PM marker, otherwise the 24-hour format should be used
Code Smell
Constructor attributes of date and time objects should be in the range of possible values
Code Smell
"f-strings" should not be nested too deeply
Code Smell
Generic functions should be defined using the type parameter syntax
Code Smell
Generic type statements should not use "TypeVars"
Code Smell
Type aliases should be declared with a "type" statement
Code Smell
Generic classes should be defined using the type parameter syntax
Code Smell
pandas.pipe method should be preferred over long chains of instructions
Code Smell
The "pandas.DataFrame.to_numpy()" method should be preferred to the "pandas.DataFrame.values" attribute
Code Smell
'dtype' parameter should be provided when using 'pandas.read_csv' or 'pandas.read_table'
Code Smell
When using pandas.merge or pandas.join, the parameters on, how and validate should be provided
Code Smell
inplace=True should not be used when modifying a Pandas DataFrame
Code Smell
Deprecated NumPy aliases of built-in types should not be used
Code Smell
np.nonzero should be preferred over np.where when only the condition parameter is set
Code Smell
Passing a list to np.array should be preferred over passing a generator
Code Smell
numpy.random.Generator should be preferred to numpy.random.RandomState
Code Smell
Results that depend on random number generation should be reproducible
Code Smell
Assignments of lambdas to variables should be replaced by function definitions
Code Smell
"isinstance()" should be preferred to direct type comparisons
Code Smell
'startswith' or 'endswith' methods should be used instead of string slicing in condition expressions
Code Smell
Fields of a Django ModelFom should be defined explicitly
Code Smell
"locals()" should not be passed to a Django "render()" function
Code Smell
Django models should define a "__str__" method
Code Smell
'null=True' should not be used on string-based fields in Django models
Code Smell
Union type expressions should be preferred over "typing.Union" in type hints
Code Smell
Built-in generic types should be preferred over the typing module in type hints
Code Smell
Type hints of generic types should specify their type parameters
Code Smell
Any should not be used as a type hint
Code Smell
Function parameters should have type hints
Code Smell
Function returns should have type hints
Code Smell
Octal escape sequences should not be used in regular expressions
Code Smell
Character classes in regular expressions should not contain only one character
Code Smell
Superfluous curly brace quantifiers should be avoided
Code Smell
Non-capturing groups without quantifier should not be used
Code Smell
Regular expression quantifiers and character classes should be used concisely
Code Smell
Regular expressions should not contain empty groups
Code Smell
Regular expressions should not contain multiple spaces
Code Smell
Single-character alternations in regular expressions should be replaced with character classes
Code Smell
Reluctant quantifiers in regular expressions should be followed by an expression that can't match the empty string
Code Smell
Tests should be skipped explicitly
Code Smell
Assertions should not fail or succeed unconditionally
Code Smell
The most specific "unittest" assertion should be used
Code Smell
Test methods should be discoverable
Code Smell
Values assigned to variables should match their type annotations
Code Smell
Function return types should be consistent with their type hint
Code Smell
Character classes in regular expressions should not contain the same character twice
Code Smell
Type checks shouldn't be confusing
Code Smell
Names of regular expressions named groups should be used
Code Smell
Character classes should be preferred over reluctant quantifiers in regular expressions
Code Smell
Regular expressions should not be too complicated
Code Smell
Builtins should not be shadowed by local variables
Code Smell
Implicit string and byte concatenations should not be confusing
Code Smell
Constants should not be used as conditions
Code Smell
Identity comparisons should not be used with cached types
Code Smell
Expressions creating sets should not have duplicate values
Code Smell
Expressions creating dictionaries should not have duplicate keys
Code Smell
"SystemExit" should be re-raised
Code Smell
Bare "raise" statements should only be used in "except" blocks
Code Smell
Comparison to None should not be constant
Code Smell
"self" should be the first argument to instance methods
Code Smell
Function parameters' default values should not be modified or assigned
Code Smell
A subclass should not be in the same "except" statement as a parent class
Code Smell
Some special methods should return "NotImplemented" instead of raising "NotImplementedError"
Code Smell
Custom Exception classes should inherit from "Exception" or one of its subclasses
Code Smell
Special method "__exit__" should not re-raise the provided exception
Code Smell
Bare "raise" statements should not be used in "finally" blocks
Code Smell
Walrus operator should not make code confusing
Code Smell
Arguments given to functions should be of an expected type
Code Smell
Unused scope-limited definitions should be removed
Code Smell
`str.replace` should be preferred to `re.sub`
Code Smell
Unread "private" attributes should be removed
Code Smell
Functions and methods should not have identical implementations
Code Smell
Unused private nested classes should be removed
Code Smell
Functions should use "return" consistently
Code Smell
Cognitive Complexity of functions should not be too high
Code Smell
Jump statements should not be redundant
Code Smell
Functions returns should not be invariant
Code Smell
String formatting should be used correctly
Code Smell
Conditional expressions should not be nested
Code Smell
Loops without "break" should not have "else" clauses
Code Smell
"pass" should not be used needlessly
Code Smell
Doubled prefix operators "not" and "~" should not be used
Code Smell
"except" clauses should do more than raise the same issue
Code Smell
The first argument to class methods should follow the naming convention
Code Smell
Method overrides should not change contracts
Code Smell
Boolean expressions should not be gratuitous
Code Smell
Methods and properties that don't access instance data should be static
Code Smell
The "print" statement should not be used
Code Smell
"<>" should not be used to test inequality
Code Smell
The "exec" statement should not be used
Code Smell
Backticks should not be used
Code Smell
Python parser failure
Code Smell
Wildcard imports should not be used
Code Smell
Boolean checks should not be inverted
Code Smell
Files should not be too complex
Code Smell
Two branches in a conditional structure should not have exactly the same implementation
Code Smell
Unused assignments should be removed
Code Smell
Methods and field names should not differ only by capitalization
Code Smell
New-style classes should be used
Code Smell
Parentheses should not be used after certain keywords
Code Smell
Docstrings should be defined
Code Smell
Track "TODO" and "FIXME" comments that do not contain a reference to a person
Code Smell
A field should not duplicate the name of its containing class
Code Smell
A reason should be provided when skipping a test
Code Smell
Module names should comply with a naming convention
Code Smell
Function names should comply with a naming convention
Code Smell
Cyclomatic Complexity of functions should not be too high
Code Smell
Functions and lambdas should not reference variables defined in enclosing loops
Code Smell
Unused local variables should be removed
Code Smell
Track lack of copyright and license headers
Code Smell
Comments should not be located at the end of lines of code
Code Smell
Functions should not have too many lines of code
Code Smell
Control flow statements "if", "for", "while", "try" and "with" should not be nested too deeply
Code Smell
Cyclomatic Complexity of classes should not be too high
Code Smell
Track uses of noqa comments
Code Smell
Track uses of "NOSONAR" comments
Code Smell
Sections of code should not be commented out
Code Smell
Track comments matching a regular expression
Code Smell
Statements should be on separate lines
Code Smell
String literals should not be duplicated
Code Smell
Functions and methods should not be empty
Code Smell
Unused function parameters should be removed
Code Smell
Local variable and function parameter names should comply with a naming convention
Code Smell
Field names should comply with a naming convention
Code Smell
Unused class-private methods should be removed
Code Smell
Functions should not contain too many return statements
Code Smell
Track uses of "TODO" tags
Code Smell
Track uses of "FIXME" tags
Code Smell
Lines should not end with trailing whitespaces
Code Smell
Files should end with a newline
Code Smell
Long suffix "L" should be upper case
Code Smell
Unnecessary imports should be removed
Code Smell
"Exception" and "BaseException" should not be raised
Code Smell
Redundant pairs of parentheses should be removed
Code Smell
Nested blocks of code should not be left empty
Code Smell
Functions, methods and lambdas should not have too many parameters
Code Smell
Mergeable "if" statements should be combined
Code Smell
Files should not have too many lines of code
Code Smell
Lines should not be too long
Code Smell
Class names should comply with a naming convention
Code Smell
Method names should comply with a naming convention
Code Smell

PySpark's "RDD.groupByKey", when used in conjunction with "RDD.mapValues" with a commutative and associative operation, should be replaced by "RDD.reduceByKey"

intentionality - efficient

maintainability

Code Smell

data-science
pyspark

This rule raises an issue when RDD.groupByKey is used in conjunction with RDD.mapValues and a commutative and associative function instead of RDD.reduceByKey.

Why is this an issue?

How can I fix it?

More Info

The PySpark API offers multiple ways of performing aggregation. When performing aggregations, data is usually shuffled between partitions. This shuffling is needed to compute the result correctly. It has an associated cost that can impact performance, as shuffling moves data over the network between Spark tasks.

There are however cases where some aggregation methods could be more efficient than others. For example when using RDD.groupByKey in conjunction with RDD.mapValues if the function passed to RDD.mapValues is commutative and associative, it is preferable to use RDD.reduceByKey instead. The performance gain from RDD.reduceByKey comes from the amount of data that needs to be moved between PySpark tasks. RDD.reduceByKey will effectively reduce the number of rows in a partition before sending the data over the network for further reduction. On the other hand, when using RDD.groupByKey with RDD.mapValues the reduction is only done after the data has been moved around the cluster, effectively slowing down the computation process by transferring a higher amount of data over the network.

Available In:

Catch issues on the fly,
in your IDE

Detect issues in your GitHub, Azure DevOps Services, Bitbucket Cloud, GitLab repositories

Analyze code in your
on-premise CI

In-IDE

SaaS

Self-Hosted